Unsupervised-learning-based method for chest MRI–CT transformation using structure constrained unsupervised generative attention networks

The integrated positron emission tomography/magnetic resonance imaging (PET/MRI) scanner simultaneously acquires metabolic information via PET and morphological information using MRI. However, attenuation correction, which is necessary for quantitative PET evaluation, is difficult as it requires the generation of attenuation-correction maps from MRI, which has no direct relationship with the gamma-ray attenuation information. MRI-based bone tissue segmentation is potentially available for attenuation correction in relatively rigid and fixed organs such as the head and pelvis regions. However, this is challenging for the chest region because of respiratory and cardiac motions in the chest, its anatomically complicated structure, and the thin bone cortex. We propose a new method using unsupervised generative attentional networks with adaptive layer-instance normalisation for image-to-image translation (U-GAT-IT), which specialised in unpaired image transformation based on attention maps for image transformation. We added the modality-independent neighbourhood descriptor (MIND) to the loss of U-GAT-IT to guarantee anatomical consistency in the image transformation between different domains. Our proposed method obtained a synthesised computed tomography of the chest. Experimental results showed that our method outperforms current approaches. The study findings suggest the possibility of synthesising clinically acceptable computed tomography images from chest MRI with minimal changes in anatomical structures without human annotation.

www.nature.com/scientificreports/ experimental results. "Discussion" discusses the results and compares them with previous studies; finally, the study is concluded in "Conclusions".

Materials and methods
In this study, CycleGAN and U-GAT-IT were used to perform MRI-CT conversion using unpaired data. In addition, we applied MIND, which was proposed in a previous study, to these networks to prevent misalignment between MRI and synthesised CT images. Please refer to Fig. 1 for an outline of the proposed U-GAT-IT + MIND process.
To compare the performance of U-GAT-IT + MIND loss, we evaluated the performance of CycleGAN alone, U-GAT-IT alone, and the CycleGAN + MIND loss.
CycleGAN. CycleGAN, developed in 2016, is a method that allows transformations between two different image domains. CycleGAN involves competing networks of an image generator (generator) and an adversarial network (discriminator) that attempt to distinguish the generated synthetic image from the real image. Taking the transformation between MRI and CT images as an example, there is a loss (G loss) to make the synthesised CT image closer to the real CT image for the generator, and a loss (D loss) to distinguish the synthesised CT image from the real CT image for the discriminator. In addition, there are two types of losses in CycleGAN: cycle loss and identity loss. Cycle loss is the difference between the original image and the double-synthesised MRI, which is further synthesised from the synthesised CT based on MRI. Identity loss is the difference between the output image and the input image (CT image and synthesised CT image) when the CT image is input to the CT generator. The same four types of losses are calculated for CT-MRI conversion (when synthesised MRI is generated from real CT). Please refer to the original paper on the conceptual diagram. The three types of losses are as follows-Eqs. (1)- (4).
Generator and discriminator loss (Generator and discriminator losses are employed to match the distribution of the translated images to the distribution of the target image): Cycle loss (To alleviate the mode collapse problem, we applied a cycle consistency constraint to the generator): Identity loss (To ensure that the distributions of input image and output image are similar, we applied an identity consistency constraint to the generator): Sum of losses (finally, we jointly trained the generators, and discriminators to optimize the final objective): (1)  Figure 1. Outline of proposed U-GAT-IT + MIND process (G, D, and η denote generator, discriminator, and auxiliary classifier, respectively). We introduce Cycle loss, which is a comparison within the same domain after two rounds of transformation; this is in addition to MIND loss, which is a comparison between different domains after one round of transformation. where I CT denotes the CT image, I MRI denotes the MRI image, G CT→MRI denotes the generator that generates MRI from CT, G MRI→CT denotes the generator that generates CT from MRI, D CT denotes the discriminator that discriminates between G MRI→CT (mri) and ct , D MRI denotes the discriminator that discriminates between G CT→MRI (ct) and mri , and L GAN denotes the loss that includes G loss and D loss. L cyc is the cycle loss, and L identitiy is the identity loss. 1 and 2 denote coefficients of losses. Finally, the model was trained by reducing the losses using Eq. (5): U-GAT-IT. U-GAT-IT is an unsupervised generative attentional network with adaptive layer-instance normalisation for image-to-image translation, which was developed in 2019. Similar to CycleGAN, U-GAT-IT uses the encoder-decoder method for image generation but incorporates the attention module in the discriminator and generator and combines them with the adaptive layer-instance normalisation function (AdaLIN) to focus on the more important parts of the image. AdaLIN is a normalization method introduced along with U-GAT-IT. It allows adaptive selection of the ratio between the commonly used Layer Normalization and Instance Normalization, which is known to be more effective in removing style changes. Combined with the attention-guided module, AdaLIN enables flexible control of the amount of change in shape and texture 20 .
Generator loss and discriminator loss (Generator and discriminator losses are employed to match the distribution of the translated images to the distribution of the target images.): CAM loss represents the loss that is important for the conversion from MRI and CT based on the information of auxiliary classifiers η MRI and η CT .
CAM losses (By exploiting the information from the auxiliary classifiers η CT , η MRI , ηD CT , and ηD MRI , given an image from I CT I MRI . G MRI→CT and D CT identify where they need to improve in terms of what makes the most difference between two domains.) : Sum of losses Finally, we jointly trained the encoders, decoders, discriminators, and auxiliary classifiers to optimize the final objective: where L ′ GAN denotes the loss that includes G loss and D loss. L G MRI→CT cam is the CAM loss of G MRI→CT , L D CT cam is the CAM loss of D CT , L cyc is the cycle loss, and L identitiy is the identity loss. 1 , 2 , and 3 denote coefficients of losses. Finally, the model was trained by reducing the losses using Eq. where I denotes the image, n denotes the normalisation constant (so that the maximum value equals 1), and r ϵ R defines the region to be calculated; D p (I, x, x + r) denotes the distance metric between the positions x and x + r ; it is expressed by Eq. (12). In this study, we considered r = 9. Calculations were performed by convolution, as in previous studies 30 . P represents a collection of quantities that shifts the image. In this case, there exist 81 sets that shift the image from −4 to 4 along the X-and Y-axis directions. MIND loss. By calculating the MIND on two different images and taking the difference between them, MIND can be used as a loss that adds constraints to the change in position between them. For CycleGAN and U-GAT-IT, the difference between the MIND of the image before and after conversion is used as the loss. The MIND loss is represented by Eq. (14). In the equation, I MIND (CT, r) is the result of adapting MIND to a CT image pixel by pixel.
By incorporating L MIND into the loss of CycleGAN and U-GAT-IT, constraints can be added to the change in structure can be added. The loss of CycleGAN and U-GAT-IT with the addition of MIND is expressed by Eqs. (15) and (16): where MIND denotes a coefficient of MIND loss. Finally, the models were trained by reducing the loss in Eqs. (17) and (18):

Experiments
This study conformed to the Declaration of Helsinki and the Ethical Guidelines for Medical and Health Research Involving Human Subjects in Japan (https:// www. mhlw. go. jp/ file/ 06-Seisa kujou hou-10600 000-Daiji nkanb oukou seika gakuka/ 00000 80278. pdf). This study was approved by The Ethics Committee at Kobe University Graduate School of Medicine (Approval number: 170032) and was carried out according to the guidelines of the committee. The Ethics Committee at Kobe University Graduate School of Medicine has waived the need for an informed consent.
In-phase ZTE acquisition on PET/MRI. All PET/MRI examinations (n = 150; mean age, 65.9 ± 13.0 years ; range 19 to 90 years) were performed on an integrated PET/MRI scanner (SIGNA PET/MR, GE Healthcare, www.nature.com/scientificreports/ Waukesha, WI, USA) at 3.0 T magnetic field strength. MR imaging of the thoracic bed position was performed with the ZTE sequence and was simultaneously acquired with a PET emission scan. No contrast-enhancing material was used. Free-breathing ZTE was acquired by three-dimensional (3D) centre-out radial sampling to provide an isotropic resolution of 2 mm 3 , large field of view of 50 cm 3 , and a minimal TE of zero with the following parameters: TR, ~ 1.4 ms; FA, 1°; 250,000 radial centre-out spokes; matrix size, 250 × 250; FOV, 50.0 cm 3 ; resolution, 2 mm 3 ; number of spokes per segment, 512; and approximate acquisition time, 5 min. To minimise fat-water chemical shift effects (i.e. destructive interference at fat-water tissue boundaries), a high imaging bandwidth of ± 62.5 kHz was used. Furthermore, the imaging centre frequency was adjusted to be between fat and water resulting in clean in-phase ZTE images with uniform soft-tissue signal response and minimal fatwater interference disturbances 33,34 . The training data of ZTE and CT were acquired from different patients (unpaired datasets); however, ZTE and CT were performed in the same body position (arms down) on the respective scanners. CT was acquired during shallow expiratory breath-holding for attenuation correction of PET and acquisition of anatomical details with the following parameters: X-ray tube peak voltage (kVp), 120 kV; tube current, 20 mA; section thickness, 3.27 mm; reconstructed diameter, 500 mm; reconstructed convolutional kernel, soft.
Dataset splitting. Data of thirty cases (20%) were used as the validation dataset, and data of the remaining 120 cases (80%) were used as the training dataset. For each case, unpaired CT and ZTE were used, and no manual annotations were performed.
Image postprocessing. ZTE images were semi-automatically processed to remove the background signals by using a thresholding and filling-in technique on a commercially available workstation (Advantage workstation, GE Healthcare) and converted into a matrix size of 640 × 400. To correct the variations in sensitivity and normalise the images of ZTE to the median tissue value, the nonparametric N4ITK method was applied 35,36 . CT images were also modified to remove the scanner beds on the workstation and were converted into the same matrix size. The MRI was maintained at the window width and window level stored in DICOM images, whereas the CT image was adjusted to a window width of 2000 Hounsfield Unit (HU) and a window level of 350 HU. The CT images were then scaled down to an image resolution of 256 × 256 pixels owing to GPU memory limitations.
Model training. All processing was performed using a workstation (CPU: Core i7-9800X at 3.80 GHz, RAM

GB, GPU: TITAN RTX) in all cases of CycleGAN, CycleGAN + MIND, U-GAT-IT, and U-GAT-IT + MIND.
CycleGAN/CycleGAN + MIND. We used a program based on the PyTorch implementation of CycleGAN 37 , which was modified for DICOM images and MIND calculations. We used values of 10, 0.5, and 20 for 1 , 2 , and MIND , respectively, in CycleGAN + MIND with Adam as the optimiser and a learning rate of 0.0002 up to 1000 epochs. A radiologist (4 years of experience) visually evaluated the results when the loss reached equilibrium. If no corruption of synthesised CT was confirmed for the training and validation datasets, the trained network was used for the main visual evaluation described below. Except for MIND , the hyperparameters of CycleGAN and CycleGAN + MIND were the same.

U-GAT-IT/U-GAT-IT + MIND.
We used a program based on the PyTorch implementation of U-GAT-IT 20 , which was modified for DICOM images and MIND calculations. We used 100 for 1 , 100 for 2 , 100 for 3 , and 5000 for MIND in U-GAT-IT + MIND, with Adam as the optimiser and a learning rate of 0.0001 up to 100 epochs. The results when the loss reached equilibrium and the training data were not corrupted by visual confirmation by the radiologist were used for evaluation. Except for MIND , the hyperparameters of U-GAT-IT and U-GAT-IT + MIND were the same.
Visual evaluation. Twenty-one cases of chest ZTE unused for the training and validation datasets were prepared as the test dataset. The test dataset did not contain any CT images. The synthesised CTs were calculated using CycleGAN, CycleGAN + MIND, U-GAT-IT, and U-GAT-IT + MIND based on axial cross-sectional ZTE images of the supraclavicular fossa, central humeral head, sternoclavicular joint, aortic arch, tracheal bifurcation, and right pulmonary vein levels in each case. In this study, the main purpose was the application of PET/ MRI attenuation-correction maps; therefore, it was particularly important to suppress the difference in anatomical structure during the conversion. For this purpose, four radiologists evaluated the synthesised CT visually, as shown below. Before evaluation by the four radiologists, a radiologist (15-year experiments) evaluated the synthesised CT images, and almost all of them were rated as CT-like for CycleGAN, CycleGAN + MIND, U-GAT-IT, and U-GAT-IT + MIND.
Evaluation of image misalignment after conversion. Visual evaluation was performed by four radiologists (Dr A, B, C, and D with 4, 22, 15, and 4 years of experience, respectively). The alignment between the synthesised CTs and the original images of the ZTE was visually evaluated for bone structures. When a relatively large defect, large displacement, or large deformation of the shape of the bone structures was observed, they were classified as having a severe misalignment. When a relatively small defect, small displacement, or small deformation of the shape of the bone structures was observed, they were classified as having a minor misalignment. One point was www.nature.com/scientificreports/ given when a total of 10 or more major misalignments were found in six images; two points when a total of five or more major misalignments were found; 3 points when a total of three or more major misalignments or 15 or more minor misalignments were found; 4 points when a total of 1 or more major misalignments or 10 or more minor misalignments were found; and 5 points for the others.
Statistics. The U-GAT-IT + MIND and other groups were compared using the Wilcoxon signed-rank test to evaluate the visual evaluation scores of the four radiologists. The Bonferroni method was used to correct multiple comparisons, and statistical significance was set at p < 0.001.

Results
Synthesised CT. In Figs. 3 and 4, the top images are MRI images, and those in the second to fifth rows are the synthesised CTs. Figures 3b and 4b show the fused images obtained from MRI and synthesised CT. The original ZTE images were synthesised in grey, and the synthesised CT image was in colour. Figures 3c and 4c show cropped images around the humerus in the original MR and fused images obtained from MR and synthesised CT. The displacement between the original ZTE images and the synthesised CT, especially in the body contour and the bone area, is improved by U-GAT-IT + MIND. Figure 5 shows the 3D VR bone images of the front and side views composited from the synthesised CT. In general, it is extremely difficult or impossible to synthesise these kinds of VR images of bone from MR images.
The upper row of Fig. 6 shows the synthesised CT based on the combination of the proposed method and conventional four-tissue segmentation, and the lower row shows the synthesised CT based on conventional four-tissue segmentation. The lower row images are clinically used for attenuation correction of PET/MRI. The upper row shows bone structures, which could not be synthesised using the conventional synthetic method (the lower row).
The upper row shows the combination of U-GAT-IT + MIND and conventional four-tissue segmentation. The lower row shows the synthesised CT based on the conventional four-tissue segmentation.

CycleGAN + MIND with high coefficient of MIND loss. The larger the coefficient of MIND loss in
CycleGAN + MIND, the more collapsed the synthesised CT became, thus distorting its contrast. Figure 12 shows the synthesised CTs from CycleGAN + MIND with a high coefficient of MIND loss ( MIND = 60), which were apparently different from those of a normal CT. Thus, the coefficients of MIND loss of CycleGAN + MIND could not be increased to the same value as the coefficients of MIND loss of U-GAT-IT + MIND.

Discussion
The combination of U-GAT-IT and MIND can help in image conversion between MRI and CT images with smaller misregistration compared to conventional unpaired image transfer (CycleGAN) using unpaired datasets. The generation of paired datasets for training is simple for the head, neck, and pelvis regions because changes in body position and deformation of organs between different scans are anatomically small, which allows simple non-rigid registrations to adjust the paired data in the hand, neck, or pelvis regions. However, in the chest region, manual annotations or registrations are required to generate the paired datasets, which makes the process extremely time-consuming; furthermore, such models lack robustness because of the anatomically complicated structures of the chest and significant changes and deformation of the images between scans due to different respiratory motions, different scanner-bed shapes, and different body positions, which were the strong motivators to develop an unsupervised method for image conversion with unpaired datasets in this study. Because the   In this work, we also tried the combination of CycleGAN and MIND; however, the generated images were apparently different in contrast to a normal CT when the coefficient of loss by MIND ( MIND ) was increased to the same range as that used for U-GAT-IT + MIND. the CAM loss introduced in U-GAT-IT prevents inconsistencies caused by the increase in MIND loss. When the coefficients of CAM loss were reduced without changing the other coefficients, the generated images seemed not to be CT-like in contrast, suggesting that the effect of CAM loss on the conversion between images was important.
There are some limitations to our study. First, we did not evaluate the effect of the synthesised CT on PET accumulation (e.g., changes in SUV) in this study. Further studies are required to confirm this hypothesis. Second, our study was conducted with a single PET/MRI scanner at a single institution, and external validation was not performed. Because the number of distributed PET/MRI scanners is limited, external validation with multiple PET/MRI scanners is difficult. Because both CycleGAN and U-GAT-IT are image conversion techniques based on unsupervised learning, the effect of overfitting is expected to be low. Fourth, it was difficult to obtain the ground truth after conversion due to the different positions and breathing conditions during PET/MRI and CT imaging, and therefore it was difficult to quantitatively evaluate the effect of the synthesised CT on PET. Further studies are required to confirm this hypothesis.

Conclusions
The combination of U-GAT-IT and MIND was effective in preventing anatomical inconsistencies between ZTE and synthesised CT and enabled the generation of clinically acceptable synthesised CT images. Our method also enables inter-modality image conversion in the chest region, which has been challenging to accomplish up until now without using human annotations.

Data availability
Japanese privacy protection laws and related regulations prohibit us from revealing any health-related private information such as medical images to the public without written consent, although the laws and related regulations allow researchers to use such health-related private information for research purpose under opt-out consent. We utilized the images under acceptance of the ethical committee of Kobe University Hospital under opt-out consent. It is almost impossible to take written consent to open the data to the public from all patients. For data access of our de-identified health-related private information, please contact Kobe University Hospital. Figure 11. Boxplot of visual evaluation scores (note: small squares indicate the median score). As can be seen, the U-GAT-IT + MIND approach demonstrates higher scores by all Drs.